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Abstract 
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\^ , determining the update to a block of variables. Existing algorithms assume that in order to 

r^ ' compute the update, a particular subproblem is solved exactly. In his work wc relax this re- 

quirement, and allow for the subproblem to be solved inexactly, leading to an inexact block 
coordinate descent method. Our approach incorporates the best known results for exact up- 
dates as a special case. Moreover, these theoretical guarantees are complemented by practical 
considerations: the use of iterative techniques to determine the update as well as the use of 
preconditioning for further acceleration. 



In this paper we consider the problem of minimizing a convex function using a randomized 
block coordinate descent method. One of the key steps at each iteration of the algorithm is 
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_J ■ 1 Introduction 

Due to a dramatic increase in the size of optimization problems being encountered, first order 
p\ . methods are becoming increasingly popular. These large-scale problems are often highly structured 

S I and it is important for any optimization method to take advantage of the underlying structure. 

Applications where such problems arise and where first order methods proved successful include 
machine learning [161 132] , compressive sensing [TJ 03] , group lasso [251 ES] , matrix completion [H [26] , 
truss topology design [27] . 

Block coordinate descent methods seem a natural choice for these very large-scale problems 
due to their low memory requirements and low per-iteration computational cost. Furthermore, 
they are often designed to take advantage of the underlying structure of the optimization problem 
[4HI42| and many of these algorithms are supported by high probability iteration complexity results 
[231 [Ml [271 [281 [M]. 
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1.1 The Problem 

If the block size is larger than one, determining the update to use at a particular iteration in a 
block coordinate descent method can be computationally expensive. The purpose of this work is 
to reduce the cost of this step. To achieve this, we extend the work in [28] to include the case of 
an inexact update. 

In this work we study randomized block coordinate descent methods applied to the problem 
of minimizing a composite objective function. That is, a function formed as the sum of a smooth 
convex and a simple nonsmooth convex term: 

mm{F{x) := f{x) + '^{x)}. (1) 



We assume that the problem has a minimum (F* > — oo), / has (block) coordinate Lipschitz 
gradient, and ^ is a (block) separable proper closed convex extended real valued function (all these 
concepts will be defined precisely in Section [2]) . 

Our algorithm (namely, the Inexact Coordinate Descent (ICD) method) is supported by high 
probability iteration complexity results. That is, confidence level p G (0, 1) and error tolerance 
e > 0, we give an explicit expression for the number of iterations k that guarantee that the method 
produces a random iterate Xk for which 

nF{xk) -F* <e)>l-p. 

We will show that in the inexact case it is not always possible to achieve a solution with small error 
and/or high confidence. 

Our theoretical guarantees are complemented by practical considerations. Because an inexact 
update is allowed, it is sensible to use iterative methods to solve the subproblems; this can be espe- 
cially beneficial for quadratic minimization. This motivates us to study the smooth quadratic case 
where the problem exhibits block angular structure, and to solve the subproblems using conjugate 
gradients [12j . The benefits of preconditioning the conjugate gradients method are well known, 
so we introduce and analyze a preconditioner that can be used to speed up the update step, and 
ultimately reduce the overall ICD algorithm running time. Finally, we present some encouraging 
computational results. 

1.2 Literature Review 

As problem sizes increase, first order methods are benefiting from revived interest. On very large 
problems however, the computation of a single gradient step is expensive, and methods are needed 
which would be able to make progress before a standard gradient algorithm takes a single step. For 
instance, a randomized variant of the Kaczmarz method for solving linear systems has recently been 
studied, equipped with iteration complexity bounds [201 [211 [151 [37] , and found surprisingly efficient. 
This method can be seen as a special case of a more general class of decomposition algorithms, block 
coordinate descent methods, which have recently gained much popularity [T9 l [23 l [M l l28 l l29 l [30 1 HO] . 
One of the main differences between various (serial) coordinate descent schemes is the way in which 
the coordinate is chosen at each iteration. Traditionally cyclic schemes j31j and greedy schemes 
[27] were studied. More recently, a popular alternative is to select coordinates randomly, because 
the coordinate can be selected cheaply, and useful iteration complexity results can be obtained 
[ig [281 [291 [301 [391 [M]. 



Another current trend in this area is to consider methods that incorporate some kind of 'inexact- 
ness', perhaps using approximate gradients, or using inexact updates. For example, [17] considers 
methods based on inexact dual gradient information, while [32] considers the minimization of an 
unconstrained convex composite function where error is present in the gradient of the smooth term, 
or in the proximity operator for the non-smooth term. Other works study methods that use in- 
exact updates when the objective function is convex, smooth and unconstrained [T], smooth and 
constrained [3] or for ^i-regularized quadratic least squares problem 



1.3 Contribution 

In this paper we extend the work of Richtarik and Takac [28] and present a block coordinate descent 
method that employs inexact updates having the potential to reduce the overall algorithm running 
time. Furthermore, we focus in detail on the quadratic case, which benefits greatly from inexact 
updates, and show how preconditioning can be used to complement the inexact update strategy. 
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Table 1: Comparison of the iteration complexity results for coordinate descent methods using an 
inexact update and using an exact update (C=Convex, SC=Strongly Convex, N=Nonsmooth, S = 
Smooth). 

Table [1] compares some of the new complexity results obtained in this paper for an inexact 
update with the complexity results for an exact update presented in [28]. The following notation is 
used in the table: by jjl^ we denote the strong convexity parameter of function (p (with respect to a 
certain norm specified later), ^ = (/u/ -|- /i*)/(l + ^^) and TZw{xq) can be roughly considered to be 
distance from xq to a solution of ([1]) measured in a specific weighted norm parameterized by the 
vector w (to be defined precisely in (fT^ l. The constants are ci = 2nmax{7^^(xo), i*'(xo) — i^*}, ci = 



2TZ^{xo) and C2 = 2nTZ^{xo)/e, and n is the number of blocks. Parameters a,(3 > control the 
level of inexactness (to be defined precisely in Section [32]). 

Table [1] shows that for fixed e and p, an inexact method will require more iterations than an 
exact one. However, it is expected that in certain situations an inexact update will be significantly 
cheaper to compute than an exact update, leading to better overall running time. Moreover, the 
new complexity results for the inexact method generalize those for the exact method. Specifically, 
for inexactness parameters a = /3 = we recover the complexity results in |28| . 

1.4 Outline 

The first part of this paper focuses on the theoretical aspects of a block coordinate descent method 
when an inexact update is employed. In Section [2] the assumptions and notation are laid out and in 
Section [3] the ICD method is presented. In Section [4] iteration complexity results for ICD applied 
to ([1]) are presented in both the convex and strongly convex cases. Iteration complexity results for 
ICD applied to a convex smooth minimization problem (^I* = in ([T])) are presented in Section [U 
in both the convex and strongly convex cases. 

The second part of the paper considers the practicality of an inexact update. Section [6] describes 
how the ICD method can be effectively implemented in the quadratic case when block angular 
structure is present. The update step is computed using conjugate gradients and we describe a 
preconditioner for the update step. Section [7] provides a detailed analysis of the spectrum of the 
preconditioned matrix, numerical experiments are presented in Section [8] and concluding remarks 
are given in Section [9l 

2 Assumptions and Notation 

In this section we introduce the notation and definitions that are used throughout the paper. 



2.1 Block structure of 



9 TV 



The problem under consideration is assumed to have block structure and this is modelled by 
decomposing the space M into n subspaces as follows. Let U G M be a column permutation 

of the N X N identity matrix and further let U = [Ui, U2, ■ ■ ■ , Un] be a decomposition of U into n 
submatrices, where Ui is N x Ni and XliLi -^« ~ ^ • ^^ ^^ clear (e.g., see [29] for a brief proof) that 
any vector x G M can be written uniquely as 

n 

X = Y,U^X^'\ (2) 

i=l 

where x^*' G Mj = M.^\ Moreover, these vectors are given by 

x« := U^x. (3) 

For simplicity we will sometimes write x = {x^^' ,x^^', . . . ,x^"'') instead of ([2]). We equip Mj with a 
pair of conjugate Euclidean norms: 

||i||(i) :=(5,t,t)i \\t\\l^^ = {B^H,t)K iGM„ (4) 

where Bi G M.^^ ' is positive definite and (•, •) is the standard Euclidean dot product. 



2.2 Smoothness of / 

Throughout this paper we assume that the gradient of / is block Lipschitz, uniformly in x, with 
positive constants li, . . . , /„. This means that, for all x G M^, i € {1, 2, . . . , n} and i G Mj we have 

\\Vd{x + Uit)-Wif{x)\\l^^<k\\t\\(i}, (5) 

where 

S/J{x) := (V/(x))» i UlVf{x) G M,. (6) 

An important consequence of ([5]) is the following standard inequality |22l p. 57]: 

fix + Uit) < fix) + (V,/(x), t) + ||Ii||2^). (7) 

2.3 Block separability of \]/ 

The function ^ : R^ — t- M U {+00} is assumed to be block separable. That is, we assume that it 
can be decomposed as: 

n 

^(x) = ^^,(x«), (8) 

where the functions ^j : Mj — >■ M U {+00} are convex and closed. 



2.4 Norms on 



t>N 



For fixed positive scalars wi,W2, ■ ■ ■ ,Wn,let w = (wi, . . . , Wn) and define a pair of conjugate norms 
in M^ by 



U*^ 1/1 



E^^II^^'^IIW' (lly|i;)':=„max(y,x)2 = ^^ri(||y»||*^))2. (9) 

-I R' h^ 1 ■ -1 

1=1 ~ 1=1 



In the subsequent analysis we will use w = I (for "^ ^ 0) and w = lp~^ (for ^ = 0), where 
/ = (^1, . . . , Zn) is a vector of Lipschitz constants, p = (pi, . . . ,Pn) is a vector of positive probabilities 
and lp~^ denotes the vector ih/pi, . . . , In/Pn)- 

2.5 Strong convexity of F 

In some of the results presented in this work we assume that F is strongly convex with respect to 
the norm || • ||^ for some w, with (strong) convexity parameter //_p(w) > 0. A function (j) : M — )> 
RU{+oo} is strongly convex w.r.t. || • ||^ with convexity parameter n^iw) > if for all x, y G dom(/), 

Hy) > 4>ix) + {<p'ix),y -x) + i^\\y- x\\l, (10) 

where (p' is any subgradient of (j) at x. The case with Hipiw) = reduces to convexity. 

Strong convexity of F may come from / or ^ or both and we will write fJ-fiw) (resp. fiqiiw)) 
for the strong convexity parameter of / (resp. ^). Following from (llOp 

fipiw) > Hfiw) + ^f$(w). (11) 



Using d?!) and ([TOj) it can be shown that 

///(/) < 1, and fif{lp~'^) < 1. (12) 

We will also make use of the following characterisation of strong convexity. For all x,y € dom0 
and AG [0,1], 

<P{\x + (1 - \)y) < X<Pix) + (1 - A)0(y) - ^^^"^f ""^ ||x - y\\l. (13) 

2.6 Level set radius 

The set of optimal solutions of ([T]) is denoted by X* and x* is any element of that set. We define 

TZwix) := max max {\\y - x*\\w : F{y) < F{x)}, (14) 

y x'gX* 

which is a measure of the size of the level set of F given by x. We assume that TZu,{xo) is finite for 
the initial iterate xq. 

3 The Algorithm 

Let us start by presenting the algorithm; a more detailed description will follow. 
Algorithm 1 ICD: Inexact Coordinate Descent 



for k = 0,1,2,... do 

Choose 5k e W such that J2iPi^k^ < a{F{xk) - F*) + p 
Choose block z€{l,2,...,n} with probability pj > 
Compute the inexact update Tf {xk) to block i of Xk 



2 
3 

4 

5: Update block i of x^: x^+i = ^k~\~ UiTc^ (xk) 
6: end for 



3.1 Generic description 

Given iterate x^ € M^, Algorithm [T] picks block i G {1,2, ...,n} with probability pi, computes 
the vector Tj (x^) G Mj and then adds it to the ith. block of Xk, producing the new iterate Xk+i- 
The iterates {xk} are random vectors and the values {F{xk)} are random variables. The update 
vector depends on Xk, the current iterate, and on 6k, a vector of parameters controlling the "level 
of inexactness" with which the update is computed. The rest of this section is devoted to giving a 
precise definition of Tg{xk). Note that from ([T|) and ^ we have, for all x € M^, i € {1, 2, . . . ,n} 
and t G Mj: 

F{x + U,t) = fix + Uit) + ^{x + Uit) < fix) + Viix, t) + ^,(x), (15) 

where 

Viix, t) := {VJix),t) + I \\t\\l^ + ^,(x« + t), (16) 

i;,ix):=Y,^^ix^^^). (17) 



That is, ([15]) gives an upper bound on F{x + Uit), viewed as a function of t G Mj. 

The inexact update apphed in step 4 of Algorithm [1] is the inexact minimizer of the upper 
bound (|15p on F{xk + Uit) (to be defined precisely below). However, since only the second term of 
this bound depends on t, the update is computed by minimizing, inexactly, Vi(x,t) in t. 

3.2 Inexact update 

The approach of this paper best applies to situations in which it is much easier to approxi- 
mately minimize t i— )■ Vi{x,t) than to both (i) approximately minimize t i— > F{x + Uit) and (ii) 
exactly minimize t i— t- Vi{x,t). For x G M anqU 6 = {6^^' , ... ,6^^') > we define Ts{x) = 
(Tj (x), . . . , Tg (x)) G M^ to be any vector satisfying 

Vi{x,T^\x)) < mm \Vi{x, 0), 5^''> + mm Vi{x,t)\ , i = l,...,n. (18) 

That is, we require that the update Tj (x) of the ith block of x is (i) no worse than a nil update, 

and that it is (ii) close to the optimal update Tq (x) = arg mint Viix, t), where the degree of subop- 
timality/inexactness is bounded by 6^^' . In particular, we consider, and give iteration complexity 
results for, the situation where at iteration k of the ICD method we choose 6k = (5^ , ■ ■ ■ , ^k ) so 
that the expected suboptimality is bounded above by a linear function of the residual F{xk) — F* , 
i.e., 

n 

4 := Y^Pi^k^ ^ c^in^k) - F*) + /3, (19) 

(i\ 

where a and /3 are nonnegative constants. Note that, for instance, ()19l) holds if we require 5^. < 
a{F{xk) - F*) + p ioT all i. 

As the following lemma shows, the update (llSp leads to a monotonic algorithm. 

Lemma 1. For all x € M^, 5 G M" and i G {1, 2, . . . , n}, 

F{x + UiT^'\x)) < F{x). (20) 

Proof: 

F{x + UiTf\x)) < f{x) + Vi{x,T^'\x))+iJi{x) 

? fix) + V,ix,0)+M^)^^^F{x). D 

3.3 Technical result 

The following result plays a key role in the complexity analysis of ICD. 

Theorem 2. Fix xq G M^ and let {xk}k>o be a sequence of random vectors in M.^ with x^+i 
depending on x^ only. Let if : M -^ M. be a nonnegative function, define ^k •= V'(^fc) '^'^'^ assume 
that {(,k}k>o is nonincreasing. Further, let p G (0, 1), e > and a, (3 > be such that one of the 
following two conditions holds: 



^We allow here for an abuse of notation {5^'' is a scalar, rather than a vector in R; as x^'' for x G R^) as we wish 
to emphasize that the scalar 5'*' is associated with the i-th block. 



(i) E[Cfc+i I Xk] < (1 + a)ik -± + P, for all A; > 0, 



where ci > 0, ^ (a + J a^ + c^ ) < ^ < '^o and a := wa^ + -^ < 1; 

(n) E[^fc+i I Xfc] < n + a - i j ^fc + /3, /or all k>0 for which ^k > e, 
where C2 > 1, ac2 < 1 and p(^i^_^ac2) < ^ < ^0- 
If (i) holds and we define u := ^(a + (t) and choos^ 

( f /3ci \ ( ^ / c \ 



i^> ^log ^:|i^ +min -logU^^ 1 ^ _ ^^ + 2, (21) 



or i/ (ii) /loZds and we choose 



C2 , / ^0 - T 



/3c2 



^ > , log ^^^ i^^ , (22) 

t/ien P(^i^ < e) > 1 - p. 

Proof. First notice that the thresholded sequence {C^.}fc>o defined by 



0, if Cfe < e, 
^fc, otherwise, 



(23) 



satisfies ^^ > e <^ ^^ > e. Therefore, by Markov's inequality, ¥{^k > e) = ^{^l > ^) ^ ~T^^- Letting 
0fc := IE[^^], it thus suffices to show that 

0K < ep. (24) 

The rationale behind this "thresholding trick" is that the sequence E[|fc] decreases faster than E[^fc] 
and hence will reach ep sooner. Assume now that (i) holds. It can be shown (for example, see 
Theorem 1 of [25j for the case a = /3 = 0) that 

E[a+i I Xk] < (1 + a)Ck -^f + (3, E[a+i I X,] < (l + a - ^) a + f3. (25) 
By taking expectations in ()25p (in x^) and using Jensen's inequality, we get 

0k+i < (1 + a)0fc - J + /3, A:>0, (26) 

ek+i < (l + a-^)efc + /3, k>0. (27) 

Notice that ()26p is better than ()27p precisely when 0^ > e. It is easy to see that the inequality 
(1 + a)9k — -r- + 13 < Ok holds if and only if 9^ > u. In other words, (i26l) leads to 6*^+1 that is better 



than 9k only for 0/^. > u. We will now compute k = ki for which u < Ok ^ e. Inequality (j26p can be 
equivalently written aqfl 

gfc+i-^<(l-^)(^fc-^)- ^^'~''^ . ^>0. (28) 

Cl 



^We ignore the first term in the minimum if cr = 0; or, equivalently, treat 1/a as oo 
We do this to eliminate the constant term /3; this allows us to provide a 
form leads to a better result; see the remarks after the Theorem for details 



We do this to eliminate the constant term /3; this allows us to provide a simple analysis. Moreover, this "shifted" 



where a < 1. Letting Ok := 6k — u, hy monotonicity we have 9k+i0k < ^^, whence 
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6k+i 6k 
If we choose r S {1, jr^}, then 

1 1 ' 
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Ci' 



(29) 



1 Oil 

-;r- > r 
6k 



+ — 1 >r 



^=1 +^yr= 

ci ^ 






1 

Cicr ' 



In particular, using the above estimate with r = 1 and r = j^ gives 

6ki < e — u (and hence dk^ < e) 

for 

ki := min < log I "7" ""'f 1 / log 

where the left term in (j3ip applies when a > only. 



1 

l-fT' 



1 
1-a 



e 



Cl 



^o-« 



(30) 
(31) 



Applying the inequalities (i) \t\ < 1 + t; (ii) log(Y^) > t (holds for < t < 1; we use the 
inverse version, which is surprisingly tight for small t); and (iii) the fact that t \-^ ^i| is decreasing 
on [0, oo) if C > Z) > 0, we arrive at the following bound 



ki < 1 + min { — lo, 



^0 - u\ Cl 



Cl 



a ' \ e — u J e ^q — u 
Letting 7 := 1 — ^~°'^^ (notice that 7 € (0, 1)), for any ^2 > we have 



(32) 



Cl 



1(13 



/3 A , /3 



1) 



1-7 



7''' I e 



1-7 



+ 



1-7- 



(33) 



In (j33p . notice that the second to last term can be made as small as we like (by taking k2 large), 
but we can never force ^fci+fej ^ j^- Therefore, in order to establish ([M|) . we need to ensure 



that 



^Cl 



- < ep. Rearranging this gives the condition ^(a + ya^ + ^) < e, which holds by 



assumption. Now we can find k2 for which the right hand side in ()33p is at most ep: 

i c - /3ci \ 



k2:-- 



log 



1-7 



e/9 



/3 
1-7. 



/log 



< 1 + 



Cl 



ac\ 



■log 



e — aci 



ep 



/3ci 



(34) 



where we have used the inequality t < log(Y:^) (which holds for t € [0, 1) and is a good approxi- 
mation for small t) with i = 1 — 7. 

In view of (j24p . it is enough to take K = ki + k2 iterations. The expression in ()2ip is obtained 
by adding the upper bounds on ki and /c2 in (|32p and 



Now assume that property (ii) holds. By a similar argument as that leading to (j25p . we obtain 

Ok^i + P < (l 



eK<[i-'-^ 



1— ac2 

C2 



K 



K-1 

0o + (3Y^ 



1- 



1— ac2 

C2 



< 1- 



l—ac2 
C2 



K 



j=0 
l-ac2 J ^ l-ac2 — *""• 



The proof follows by taking K given by (j22p . 

Let us now comment on several aspects of the above result: 



D 



1. Usage. We will use Theorem [2] to finish the proofs of the complexity results in Section [H with 
Cfc = v{^k) '■= F{xk) ~ F* ^ where {x^} is the random process generated by ICD. 

2. Monotonicity and Nonnegativity. Note that the monotonicity assumption in Theorem [2] is 
for the choice of x^ and (p described in 1) satisfied due to (j20p . Nonnegativity is satisfied 
automatically since F{xii.) > F* for all x^. 



3. Best of two. In (|3ip . we notice that the first term applies when o" > only. If ct = 0, then 
u = 0, and subsequently the second term in (|3ip applies, which corresponds to the exact case. 
Notice that if a > is very small (so u ^ 0), the iteration complexity result still may be 
better if the second term is used. 

4. Generalization. Note that for a = /? = 0, (j2ip recovers — (1 + log-) + 2 — 9-, which is the 
result proved in Theorem l(i) in [28], while ()22p recovers C2log((F(xo) — F )/ep), which is 
the result proved in Theorem l(ii) in [28j. Since the last term in (12 ip is negative, the theorem 
holds also if we ignore it. This is what we have done, for simplicity, in Table [TJ 

5. High accuracy with high probability. In the exact case, the iteration complexity results hold 
for any error tolerance e > and confidence p € (0, 1). However, in the inexact case, there are 
restrictions on the choice of p and e for which we can guarantee the result V{F{xk)—F* < e) > 
1 — /?. Table [5] gives conditions on a and /3 under which arbitrary confidence level (i.e., small 
p) and accuracy (i.e., small e) is achievable. For instance, if Theorem l(ii) is used, then one 
can achieve arbitrary accuracy only if /3 = 0, but arbitrary confidence under no assumptions 
on a and /3. The situation with part (i) is worse: e is lower bounded by a positive expression 
that involves />, unless a = /3 = 0. 





Theorem El^i) 


Theorem [2]^ii) 


e can be arbitrarily small if 


a = /3 = 


/3 = 


p can be arbitrarily smah if 


/3 = 


any a, (3 



Table 2: The conditions under which arbitrary confidence p and accuracy e are attainable. 



6. Two lower bounds on e. The inequality e > ^ ( a + \ cP' H — ^ 1 (see part (i) of Theorem [2]) 



is equivalent to e > 



Note the similarity of the last expression and the lower bound 



p{e-oci)' 

on e in part (ii) of the theorem. We can see that the lower bound on e is smaller (and hence, 
is less restrictive) in (ii) than in (i), provided that ci = C2. 
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7. Two analyses. It can be seen that analyzing the "shifted" form (j28p leads to a better result 
than analyzing (I26p directly, even when /3 = 0. Consider the case /3 = 0, so that a = a and 
u = ac\. From equation ([29]) 0a,.+i < ^ := ac\ + (1 — a)/(g-3^j^ + i")' whereas analyzing 
equation (p6]) directly yields 0^+1 < ^ := (1 + Oi)/{-^ + ^). It can be shown that A < B, 
with equality if a = 0. 

4 Complexity Analysis: Convex Composite Objective 

The following function plays a central role in our analysis: 

H{x,T) := f{x) + (V/(x),r) + \\\T\\} + ^{x + T). (35) 

Comparing ([35]) with ([IHD using ([2]), (l3|), (l6|), ([8]) and dll) we get 

n 

H{x,T) = f{x) + Y,yi{x,T^'^). (36) 

It will be useful to establish inequalities relating H evaluated at the vector of exact updates 
Tq{x) and H evaluated at the vector of inexact updates Ts{x). 

Lemma 3. For all x G M^ and 5 € M" , 

n 

H{x,To{x))<H{x,Ts{x))<H{x,To{x)) + ^6^'^. (37) 

Proof: 

n n 

H{x,To{x)) ^ fix) + ^V,{x,Tl^'\x))^ fix) + ^mmViix,t) 

2=1 1=1 

n 

< f{x) + Y,V^ix,4\x))^ Hix,Ts{x)) 



i=l 

n 



< /(x) + y f5» + minV-(x,t)) ^i7(x,To(x)) + y5«. D 
i=\ i=\ 

The following Lemma provides an upper bound on the expected distance between the current 
and optimal objective value in terms of the function H. 

Lemma 4. For x,T G M^, let xj^{x,T) be the random vector equal to x + UiT^^' with probability 
- for each i € {1, 2, . . . , n}. Then 

E[Fix+{x,T)) -F*\x]< l{H{x,T) - F*) + ^{F{x) - F*). 
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Proof: 

n 

E[F(x+(x,r)) |x] = ^iF(x + C/,r») 

n 

®±^ ^i^(x,T)+i^/(.)+ij:j:^,(.o)) 

lH{x,T) + r^F{x). D 

Note that if x = a:^ and T = Ts{xk), then x+(a;,T) = x^+i, as produced by Algorithm [TJ The 
fohowing Lemma, which provides an upper bound on H, wih be used repeatedly throughout the 
remainder of this paper. 

Lemma 5. For all x € doniF and 6 € M" (letting A = ^- 6^^' ), we have 

H{x,Ts{x)) < A + min \F{y) + '-^^\\y - x\\f} . (38) 

Proof: 

1(33 

H{x,Ts{x)) < A+ mm H{x,T) 

= A + min H(x, y — x) (where y = x + T) 

^ A+ mm{f{x) + {\/f{x),y-x) + '^{y) + ^\\y-x\\f} 



COJ 



< A+ mm {f{y)-^\\y -x\\f + ^iy) + Uy-x\\f} . D 



4.1 Convex case 

Now we need to estimate H{x, Ts{x)) — F* from above in terms of F{x) — F* . 

Lemma 6. Fix x* e X* , x e domF, 5 € M" and let R = \\x — x*\\i and A = Y^iS^^\ Then 

* . U'^-^^^^)iF{x)-F*), ifF{x)-F*<R^, , , 

R{x,Ts(x))-F* < A^{\ , 2R^ ^^ ^' !^ J y I - > 3g 

^ ' '"■ " - |iij2< i(^(^)_^*)^ otherwise. ^ ' 

Proof: Because strong convexity is not assumed, /i/(Z) = 0, so 

l l38t 
H{x,Ts{x)) < A+mm{F{y) + \\\y-x\\'i} 



< A + min {F{\x* + (1 - A)^) + ^ llx - x* 11?} 

AG[0,1] ^ / 2 n 

< A + min {Fix) - X(F(x) - F*) + ^rH. 
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Minimizing in A gives A* = min{l, {F(x) — F*)/R^] and the result follows. D 

We now state the main complexity result of this section, which gives the number of iterations 
sufficient for ICD used with uniform probabilities to push the value of the objective within e of the 
optimal value with probability at least 1 — p. 

Theorem 7. Choose an initial point xq € M and let {xk}k>o be the random, iterates generated 
by ICD applied to problem ([1]), using uniform probabilities Pi = - and inexactness parameters 

5^ , . . . , 5^" > that satisfy (fTOl) for a, p > 0. Choose target confidence p G (0,1) and error 
tolerance e > so that one of the following two conditions hold: 



(i) ^{a+Ja^ + ^) < e < F{xo)-F* anda'^+^ < 1, where d = 2nmax{nf{xo),F{xo)-F*}, 

(ii) -j^^T^ < e < ^lin{7^2(xo),F(xo) - F*}, where C2 = ?I^^ and ac2 < 1. 

// (i) holds and we choose K as in ([2T]) . or if (ii) holds and we choose K as in ({22]) . then P(F(xic) — 
F* <e)>l- p. 

Proof. Since F{xk) < -^(2:0) for all k by ([20l) . we have \\xk — x*\\i < TZi{xq) for all k and x* € X*. 
Using Lemma H] and Lemma [6l we have 

mk+i\xk] < h + i max {1 - 211^/4*11^ ^l}^k + ^ik (40) 

= 4 + max{l-2^^^^,l-2^}6 
< 4 + max{l-^^^,l-2^}efc, (41) 

where ^^ := F[xk) — F* . Consider case (i). From (j4ip and ()19p we obtain 

Efe+i I Xk] < 4 + (1 - |)efc < (1 + a)^k - I + /5, (42) 

and the result follows by applying Theorem [2]^i). Now consider case (ii). Notice that if S,k ^ e, then 
[1]) together with ([T9|), imply that 



EKA:+i|^fc]<4 + max{l-2^^|^,l-2^}efc<(l + a-^)efc + /3. 
The result follows by applying Theorem [2]^ ii) . D 

4.2 Strongly convex case 

Let us start with an auxiliary result. 

Lemma 8. Let F be strongly convex with respect to \\ ■ \\i with Pf{l) + /-f<i'(/) > 0. Then for all 
X € domF and 5 G M", with A = X^j(5'*s we have 

H{x,Ts{x))-F* < A+ (i^) {F{x)-F*). 



13 



Proof: Let fij = iJ,f{l), fJ-^ = /U<ji(/) and A* = (nf + /iit)(l + /x*) < 1. Then using ([38|) we can 
upper-bound H{x,Ts{x)) as follows: 



A+ mm{F{y)+^-^\\y-x\\f} 



< A + mmjF{Xx* + (1 - A)x) + ^^^^^\\x - x*\\f} 



AG[0,1] 



GD+Gl 



{l-^if)\^-{^^f+^J,^)\{l-X) ^^^ _ ^..2^ 



< A + min {AF* + (1 - \)F(x) + viz^lii^^u;iZ^l£Z^iiZ:v lU - x*\f} 

< A + F(x) -A*(F(x) -F*). 

The last inequality follows from the fact that {fif + ^ii')(l — A*) — (1 — fJ'f)^* = 0. It remains to 
subtract F* from both sides of the final inequality. D 

We can now estimate the number of iterations needed to push a strongly convex objective F 
within e of the optimal value with high probability. 

Theorem 9. Let F be strongly convex with respect to the norm \\ ■ \\i with fJ.f{l) + fJ-^il) > and 
let /i := iT—^-m-^- Choose an initial point xq e M^ and let {xk}k>o, be the random iterates 
generated by ICD applied to problem ([T]), used with uniform probabilities Pi = - for i = 1, 2, . . . ,n 
and inexactness parameters 5^ , . . . ,5^ > satisfying P^ . for < a < ^ and /3 > 0. Choose 
confidence level p € (0, 1) and error tolerance e satisfying , _ — r < e < F{xq) — F* . Then for K 
given by ^, we have ¥{F{xk) - F* < e) > 1 - p. 

Proof. Letting ^^ = F{xk) — F*, we have 

(Lemma |4j 

E[a+i I Xk] < l{H{xk,Ts,{xk)) - F*) + n^ik 

(Lemma [Sj 

n \^l+^*(0' 

By (fT2]) , /i < 1 , and the result follows from Theorem [2][ii) with C2 = ^ > 1 . D 



< " 4 + i f SfS^e.) + ^ik 



M 



5 Complexity Analysis: Smooth Objective 

In this section we provide simplified iteration complexity results when the objective function is 
smooth (^ = so -F = /). Furthermore, we will present complexity results for arbitrary (rather 
than uniform) probabilities pi> Q. 

5.1 Convex case 

In the smooth exact case, we can write down a closed-form expression for the update: 

To«(x) ^ argminy,(x,t) # argmin{(V./(x), t) + 'i\\t\\\^} = -}-B7^VJ{x). 
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Substituting this into Vi{x,-) yields 

V,ix,4\x)) = {VJ{x),4\x)) + ^i\\4\x)\\l^ = -^(||V./(x)||^,))2. (43) 

We can now estimate the decrease in / during one iteration of ICD: 

/(x + (/,rW(x))-/w < (Vi/(x).T«(i)) + |||r«(i)||^,i 

GHJ .■^ r\ 

< mm{{),6^'> +Viix,T^\x))} 

W ^in{o,5«-^(||VJ(x)||^,))'}. (44) 

The main iteration complexity result of this section can be now established. 

Theorem 10. Choose an initial point xq S M and let {xk}k>o be the random iterates generated by 
ICD applied to the problem of minimizing f, used with probabilities pi, . . . ,pn > and inexactness 
parameters 6f, ,. . . ^5^ > satisfying (fT9]) for a,/3 > 0, where a^ + -^ < 1 and ci = 2TZ^ i{xq) . 

Choose target confidence p € (0, 1), error tolerance e satisfying ^-{a + J a^ + -^) < e < f{xQ) — f*, 



and let the iteration counter K be given by (|21|) . Then P(/(x/<) — f* < e) >1 — p. 

Proof. We first estimate the expected decrease of the objective function during one iteration of the 
method: 

n 

E[/(:rfc+i) I Xk\ = f{xk) + J2pi[f{xk + UiTf^{xk)) - f{xk)] 

i=l 
® • ■ — MO 1 

Pi [c"- 

i=\ 

1 = 1 

< /(xfc)-i(||V/(xfc)||,V0' + "(/(^fc)-/*) + /3- (45) 

Since /(x^) < /(xq) for all A;, 

/(xfc) - r < max (V/(xfc),Xfc - x*) < ||V/(xfc)||* l7^;p-l(xo). (46) 

x*£X* ^ 



i=\ 

n 

/(x,)-i(||v/(x,)||,;-0' + E^4' 



Substituting (j46j) into (j45|) we obtain 

E[/(xfc+i) - /* I Xfc] < /(x,) - /* - 1 [^^^ ' + «(/(^;t) - /*) + /3. (47) 

It remains to apply Theorem [2|^i) . D 
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5.2 Strongly convex case 

In this section we assume that / is strongly convex with respect to || • ||/p-i with convexity parameter 
fif{lp~^). Using ([To]) with x = Xk and y = x* , and letting h = x* — x/^, we obtain 

r-fixk) > {^f{xk),h) + >^i^\\h\\i-^ 

= f^fi^p''){i]ijik^)'^fi^k),h)+mip-^)- (48) 

By minimizing the right hand side of (|48p . and rearranging, we obtain 

fi^k) - r < ^^^(|^_i) (l|V/(xfc)||,;-0^. (49) 

We can now give an efficiency estimate for the case of a strongly convex objective. 

Theorem 11. Let f be strongly convex with respect to the norm \\ ■ \\ip-i with convexity parameter 
fif{lp~^) > 0. Choose an initial point xq € M and let {xk}k>o be the random iterates generated by 
ICD applied to the problem of minimizing f , used with probabilities pi, . . . ,pn > and inexactness 
parameters 6j, , . . . ,6j!^ > that satisfy (|19p for < a < lJ'f{lp~^) and /5 > 0. Choose the 
target confidence p G (0,1), let the target accuracy e satisfy -? — ,^ _l^^^ < e < /(xg) — /*, let 
C2 = l/fif{lp~^) and let iteration counter K be as in ()22p . Then ¥{f{xK) — /* < e) > 1 — P- 

Proof. The expected decrease of the objective function during one iteration of the method can be 
estimated as follows: 

63 1 o 

]E[/(x,+i) - f*\xk] < (1 + a){f{xk) - D - 2(l|V/(^;^.)ll/;-i)' + /? 

< (l + a-^^(Zp-i))(/(xfc)-r) + /3 
It remains to apply Theorem [2)^ii) with (p{xk) = f{xk) — f* (and notice that C2 > 1 by ()12p ). D 

6 Inexact Updates in the Quadratic Case 

The goal of the second part of this paper is to demonstrate the importance of employing an inexact 
update in the block coordinate descent method through the use of a specific example: the special 
case when / is quadratic, ^ = 0, and the problem exhibits block angular structure. We motivate 
this choice now. 

Suppose that ^ = 0, so the function F{x) = f[x) is smooth and convex. Then the overapprox- 
imation dTj) becomes 

F{x + Uit) = f{x + Uit) < f{x) + {Vif{x), t) + ^{B,t,t). (50) 

At every iteration of the block coordinate descent method, the update t is found by determining the 
minimizer of the overapproximation (j50p , and this is equivalent to solving the system of equations 

B,t = -^Vifix). (51) 
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Clearly, solving systems of equations is central to the block coordinate descent method in the 
smooth case because a system of the form ()5ip must be solved at each iteration to determine the 
update to apply to the ith block. Furthermore, minimizing a quadratic function is equivalent to 
solving a system of equations, which is why it is a natural choice to study the case ^ = and / 
quadratic. 

Further, this method is designed to accommodate blocks of data. Matrices with block structure 
arise in many areas, and often the structure is dictated by the application under consideration. In 
particular, there is a broad class of problems that involve matrices with block angular structure, 
originating from loosely coupled systems and producing nearly separable structures; see for example 
Dantzig [6]. Matrices with this structure commonly arise in optimization, from optimal control, 
scheduling and planning problems to stochastic optimization problems, and exploiting this structure 
is an active area of research [5l [T0\ [33] . This important example motivates the work here and we 
demonstrate that our theoretical developments provide a framework to specialize first order methods 
for such problems. 

6.1 Problem setup 

Suppose we have the following unconstrained quadratic minimization problem 

uimf{x) = ^\\Ax-bg (52) 



where A e R^""^ , b e R^^ and x e R^ . Recah that Ui is defined in Section O so that Ai = AUi, 
and consider the following 

f{x + Uit) = ^\\A{x + Uit)-b\\l 

= ipx - bg + {UlA^iAx - b),t) + ^\\AU,t\\l 

= f{x) + {Vifix),t) + ^{A[Ait,t). (53) 



Comparing (j53|) with (|50|) . we see that in the quadratic case (|53|) is an exact upper bound on 
f{x + Uit) if we choose Lj = 1 and Bj = AjAi for all blocks i = 1, . . . ,n. The matrix Bi is required 
to be (strictly) positive definite so Ai is assumed to have full (column) ranko 
Now, substituting Lj = 1 and Bi = AfAi into (j5ip gives 

AfAit = -Af{Ax-b). (54) 



Therefore, at each iteration of ICD applied to problem (I52p . the update t is found by solving the 
system (j54p . The user defined matrix Bi is positive definite, (it must be so because it defines 
a norm; recall Section [2711) . so the exact solution to (fST]) is (Tq =)t = —jj-B^^ Vif{x) and a 
standard approach to solving (j5ip is to form the Cholesky factors of Bi followed by two triangular 
solves. However, armed with the iteration complexity results for ICD presented in the first part 
of this work, it is now possible to apply an iterative technique to ([5T]) to find an inexact update 



*If a block Ai does not have full column rank then we simply adjust our choice of Li and Bi accordingly, although 
this means that we have an overapproximation to f{x + Uit), rather than equality as in H53[l. 



17 



{Tg (x) =)t. Because Bi is positive definite, a natural choice in the inexact case is to solve the 
system (jSlj) using conjugate gradients [El [35], and this is the method we adopt in what follows. 

It is expected that an iterative technique will be faster than a direct method, so the update t 
can be determined more quickly, and subsequently the overall algorithm time reduces. 

6.2 Block angular structure 

Now suppose that A G R^-'^^^ has block angular structure. Define 

" C " 



A 



D 



(55) 



where the columns of A are partitioned into n blocks 

c 



Co 



rtmxN 



(56) 



and 



D = [Di D2 



^n 



D„\ G 



r,exN 



(57) 



Furthermore, assume that each block Cj has size Mj x Ni, and the linking blocks Di have size 
i X Ni, respectively. We assume that £ < Ni and subsequently £ <^ N and that there are n blocks 



with m 



-.iNi. 



J21^^ Mi so M = m + £, and N = J2Z 

Remark: Notice that if Z) = 0, where is the £ x N matrix of all zeros, then problem ()52p is 
completely (block) separable so it can be solved easily. The linking constraints D make problem 
(j52p nonseparable, which makes it difficult to solve. 

A block of columns A- has the form 



Ar 



a 



D, 



sMxNi 



(58) 



so 

Bi=A[A^ = CfCi + DfD^, 

and we can rewrite the system ()54p as 

(Cf Q + DfD^)t = -Aj{Ax - b). 



(59) 
(60) 



The system of equations ([5U|) must be solved at each iteration of ICD because it determines the 
update to apply to the ith block. As previously mentioned, the system ([60]) can be solved inexactly 
using an iterative method and we use the conjugate gradients method in the numerical experiments 
presented in Section [8| 
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6.3 Preconditioning the update step 



It is important that (j60p be solved quickly, and one way to speed up an iterative method is to 
apply a preconditioner. It is commonly accepted that the conjugate gradient method often needs a 
good preconditioner to enable fast convergence and finding effective preconditioners for conjugate 
gradients is an active area of research; see for example [2l [9l 111] , 

In general, a sensible choice for a preconditioner is one that approximates the Hessian, in this 
case Bi = AjAi. We propose the preconditioner (for the ith system) 

V^ := Cla. (61) 

If Mi > Ni and rank(Ci) = Ni, then the block CfCi is positive definite and therefore is 
nonsingular, so the preconditioner (|6ip is also nonsingular. Applying V^' to Bi gives: 

M^ := V^^Bi = Pri {Vi + DjDi) = 1 + Vr^DfDi. (62) 



However, if Mi < Ni then Vi defined in (I6ip is rank deficient and is therefore singular. To 
remedy this, when Mi < Ni we perturb (I6ip by adding a positive multiple of the identity matrix, 
and propose the nonsingular preconditioner 



Vi = V^ + pI = CfCi + pi, (63) 

Mi = -p-^Bi = V-^V^ + V-^DfD^, (64) 

-pr^n = {cfa + pir'cTa. (65) 



where p > 0. We have 
where 



Applying the preconditioners (defined in ([6T]) for Mi > Ni, and (j63|) for Mi < Ni) to ([60]) . 
should result in the system having better spectral properties than the original, and this will lead 
to faster convergence of the conjugate gradient algorithm. Specifically, the preconditioner should 
shift the spectrum so that the eigenvalues of the preconditioned system are clustered around one, 
with few outliers. Studying the eigenvalues of the preconditioned matrix is the topic of the next 
section. 

7 Eigenvalues of the preconditioned matrix 

To investigate the quality of a preconditioner, we study the eigenvalues of the preconditioned 
matrices Mi and Mi defined in (j62p and (j64p . respectively. We will make use of the following 
simple result. 

Theorem 12 (Theorem 2.8 in [S]). Let A and B bemxn andnxm complex matrices, respectively. 
Then AB and BA have the same nonzero eigenvalues, counting multiplicity. 

Therefore, the nonzero eigenvalues of the Ni x Ni matrix P^^ DfDi are the same as the nonzero 
eigenvalues of DiV~ DJ . We prefer to work with DiV~ DJ because it is symmetric and positive 
semidefinite, so it has real, nonnegative eigenvalues. (Furthermore, if Di has full (row) rank, then 
DiV~ DJ is positive definite.) 
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Lemma 13. Let n = rank(A) and ri < Ni. Then V^^DjO-i € M^'>^^' has 
(i) Tj eigenvalues that are strictly positive, 
(a) Ni — ri eigenvalues equal to zero. 
Consequently, we have the fohowing result. 
Theorem 14. Let ri = rank(A) and r^ < Ni. Then Mi = L + V~'^DfDi has 
(i) ri eigenvalues that are strictly greater than one. 
(a) N — ri eigenvalues equal to one. 

We can say more about the eigenvalues of Mi by considering the blocks of A and investigating 
the relationship between the matrices C and D, defined in (j56p and ()57p . respectively. (Note 
that the eigenvalues of V~ DJ Di can be determined exactly by solving the generalized eigenvalue 
problem DJ DiV = XPiV.) 

Recall that the blocks along the diagonal of C are Cj S "^^'h^-Ni^ The remainder of this section 
is broken into two parts. The first part considers the case when Mi > Ni while the second part 
considers the case when Mi < Ni. In each case Cj is assumed to have full rank. 

7.1 Skyscraper shaped blocks 

Here it is assumed that the blocks Cj have size Mi x Ni where Mj > Ni, and that Ni = rank(Cj) 
(i.e., Ci has full column rank). 

The preconditioner (I6ip is applied to the system (j60p and we are interested in the distribution of 
the eigenvalues of the preconditioned matrix Mi = V~ Bi. Subsequently, we study the eigenvalues 
oiVr^DjD,. 

We consider the general case when the matrix Di G M^^^» with 1 < I < Ni. The matrix Ci is 
assumed to have full rank, so the rows of Ci contain a basis for M ' . Subsequently, each row in the 
linking matrix Di is a linear combination of the rows of Ci\ i.e., for Zi G M^^^'^j we can write 

Di = ZiC. (66) 

Lemma 15. Let Vi G M^»^^» and Di G M^^^^ he the matrices defined in (f6T]) and (f66l) respectively 
and let zj denote the jth row of Zi. Let Ci = YiRi denote the thin QR factorization of Ci, so 
Yi G ]R^^'^^» has orthonormal columns and Ri G M^»^^» is upper triangular JEJi- Then 

e 
trace(A^r'A^) = ^W^^.g < \\Zifp. (67) 

3=1 

Proof. The trace is simply the sum of the diagonal entries of a (square) matrix, so consider the 
diagonal elements of DiV^^Dj . 

DiVr^D, = ZiCi{ClCi)-^Cfzl = ZiY,Ri{RfY^Y^Ri)-^R[Y;^Zl = {Z,Yi){ZiYif 

The jth diagonal element of DiV^ Df can be written as 

{D,Vr^Df)j, = \\Y^z,\\l 
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Furthermore, ||.^j|||' = X^j=i IkjUi- Because YiY^ is a projection matrix, 

llv^-y-ll-^ — WV-v'^-y-W'^ <: ll'y-ll^ 
and the result fohows. D 



Remark: When Cj is square and has full rank, Yi is an orthogonal matrix, and subsequently 

^ llz-||2 



tv3.ce{D^Vr'Df) = ZU W^M = \\Z^\\l 



Theorem 16. Suppose that the matrix A G ^*^x^ Jiag primal block angular structure, with rect- 
angular blocks Ci € M '^ ' (Mi > Ni) of full rank (N-i = rank {Ci)) along the diagonal. Suppose 
that Bi, Di and Vi are defined in ([59]) . ([66]) and ([6T]) respectively, and let ri = rank(Dj) where 
fi ^ Ni. Then V~ Bi has 

(i) Ni — Tj eigenvalues equal to one, 
(ii) ri eigenvalues that are strictly greater than 1, and sum to ri + X^,-^]^ 11^"^ -^j 111- 

7.2 Warehouse shaped blocks 

Now it is assumed that the blocks Ci have size Mj x Ni where Mi < Ni, and that each block has 
full (row) rank, Mi = rank(Ci). 

For the block coordinate descent method, the matrix Bi = AjAi must have full rank because 
it defines a norm (see Section [2. ip . However, when Ci is warehouse-shaped, Vi defined in (I6ip is 
rank deficient so we use the preconditioner Pi defined in (j63p . 

Recall that in this case, the preconditioned matrix is defined in (I64p . We study the eigenvalues 
of 'P~ Vi and V^ DfDi separately, before stating the main result of this section, which describes 
the eigenvalues of Aii. 

We begin by describing the eigenvalues of V^ Vi ()65p . 



Theorem 17. Let Ci be a real Mi x Ni matrix with Mi < Ni and full row rank Mi = rank(Cj). 
Let Vi and Vi be defined in (I6ip and (I63p respectively, and let Mi = rank(7^j). Then V^ Vi has 
Ni — Mi zero eigenvalues and Mi positive eigenvalues that tend to 1 as p —^ 0. 

Proof. The matrix Vi has full rank so 

Tank{V^^Vi) = Tank{Vi) = Mi. 

Therefore, V~ Vi has Mi nonzero eigenvalues and Ni — Mi zero eigenvalues. Furthermore, the Mi 

nonzero eigenvalues are positive. Indeed, V[' and Vi are both symmetric positive semidefinite 

matrices, and by Theorem \T2\ the nonzero eigenvalues of V^ Vi are the same as the nonzero 

^ _ 1 ^ _i 
eigenvalues of Vi ^ ViVi ^ . The latter matrix is clearly positive semidefinite so its eigenvalues are 

nonnegative. 

Let Ci = UTjV^ denote the singular value decomposition of Ci, and let Ai, . . . , Aj./^ denote the 

Mi nonzero eigenvalues of Vi. Then we can write Vi = CfCi = VKV , where 



A = S^S 



Ai 



and Ai 
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-^M,, 



and denotes the {Ni — Mi) x (Ni — Mi) matrix of all zeros. 
Define the matrix Ai € R^^i^^'^^ and its inverse as follows: 



Ai =Ai 



Pl 



Ai + p 



Am, + P_ 



a; 



Ai+p 



>^Mi+P. 



(68) 



Now consider the preconditioner Vi = Vi + pi, which has the following singular value decom- 
position: 

V, = Vi + pI = VkV^ and p'^ = {Vi + pl)'^ = Vk-^V^ (69) 



where 



A 



Ai 



Pl. 



and A 



-1 



Ar^ 



^^ 



and Ai and A^ are defined in (j68p . Then 



V7^Vi = vk-^v^vAv^ 



V 



and as p — 7- 



' ^j+P 



1 for j = 1, 



V 



Mi. 



Ai+p 


1 






Aj\fj+P 


5^ 


Ai 
Ai+p 


Aa/j 






Ailfj+P 










Am, 



y^ 



y^ 



D 



Remark: This is a very useful result because it also shows that "P^^ Vi is a symmetric matrix. 
(We know that, in general, the product of two symmetric matrices does not have to be symmetric. 
However, this result shows that, for a general matrix A, if ^ is a symmetric matrix then [A+pI)~^A 
is also symmetric.) 

In what follows, we make use of the following simple idea. We assume that the matrix Cj has 
full (row) rank, so the rows of Cj form a basis for a subspace W = span{c^ , • • • , c]^*/, } C M ' , where 

{Cj )'^ is the jth row of Cj. Let W"*" denote the orthogonal complement of W. Any vector v € 
can be expressed as 



_L 



V = w + w , 



where w; G W, w^ ^ W^ . (70) 

Recall that the blocks Cj G ^^r^^t are warehouse-shaped (Mj < Ni) and have full (row) rank: 
Mi = rank(Cj). Suppose that the matrix A G M^^^' where £ > Ni - Mi and that Ni = rank(^j) 
to ensure that Bi has full rank. Furthermore, let VF be an ^ x A'j matrix whose rows wj € W, for 
j = !,...,£, and let W^ be an £ x Ni matrix whose rows (w^)'^ G W-*-, for j = !,...,£. Then one 
can write 

Di = W + W^. (71) 
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Recall the preconditioned matrix Aii in ()64p and notice that it remains to study the eigenvalues of 

By Theorem \T2\ the nonzero eigenvalues of "P^ Dj Di are equivalent to the eigenvalues of 
DiP~ Df and we prefer to work with this small, symmetric, positive definite matrix (Dj has full 
row rank by assumption). Now we study the diagonal elements of DiP^ Df . 

{D,Vr^Df)j, = (w, + wffvr\wj + wf) 

= wJVr^Wj + 2{wffV-'wj + {wffVr\wf). (72) 

Recall the eigenvalue decomposition (|69p and let V = \Vi V2] be a partitioning of V, where 
Vi € R^i^^i^ and V2 G I{^ixiNi-M^) ^ ^^^ consider the first term on the right hand side of ([72]) : 



wJp-^Wj = wJVA-^V^Wj = [ wJVi I wJV2 



[Ar^ 










^^1 



The vector u; is a linear combination of the rows of Ci and the columns of V2 form a basis for the 
null space of Ci , so 



wJp-^Wj = [ wJVi I 



[A- 











p 



1t/T„ 



w'jViA^'V{wj. 



(73) 



This is important because - — > 00 as p — > 0, but because V2WJ = 0, none of the components in 
V^Wj grow too large, so neither does the term wJV~ Wj. Similarly 



«f^rS = [ HVv^ \ {wf)^v2 ] 



[Ar^ 











^^1 



{wffViA^W^Wj 



(74) 



so {wj~) Vj^ Wj also remains independent of the parameter p. However, the last term in (j72 



becomes 



{wffV-\wf) = [ {wffV, \ {wffV2 ] 



Ar 



p 



^IH) 



vi (wf) 



iwffV.A^'vnwf) + ^(wfrV2V2^{wf) 



(75) 



and as yO — > 0, {w^)'^V^ {wj-) -^ co. 

Combining (|73p . (j74p and (j75p we can rewrite (|72p in the following way 



iD,V~'Df),, = iw,+wffViA^W,^iw, + wf) + -iwffV2V,^iwf) 



1 

-I 

P 



\A^^V{{wj + w 



f)\\l 



+ -\\Vy^w 



P 



2 «'j 112- 



(76) 



This demonstrates that the choice of the parameter p is very important. There is a trade-off 
here: p should not be too small or (wj-) V^' {wj-) will become arbitrarily large, but a small value 
of p will lead to a good clustering of the eigenvalues around one (Theorem 1 17p . 

Before we say more about the eigenvalues of V~ DJ Di, we present the following result. 
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Theorem 18 (Theorem 4.3.1 in [l3]). Let A and B be N x N Hermitian matrices and let the 
eigenvalues Xi{A), Xi{B) and Xi{A + B) be arranged in increasing order (X^im = Xi < ■ ■ ■ < Xpf = 
'^maxj- For each k = 1,2, . . . ,N we have 

Xk{A) + Xi{B) < Xk{A + B)< Xk{A) + Xn{B). 

Now we state the main result of this section. 

Theorem 19. Let Ci be an Mi x Ni matrix with Mi < Ni and Mi = rank(Cj) and let Di be an 
I X Ni matrix with Vi = rank(Z?j). Let Vi be the preconditioner defined in ()63p and let Ai and Bi 
be defined in (|58p and (j59p respectively with Ni > Si = rank(j4j). Then A4i = V^ Bi has 

(i) Ni — Si eigenvalues equal to zero. 

(a) Si — ri eigenvalues in the interval (0, 1) 

(Hi) ri eigenvalues in the interval 



1,1 + ^ (||A7 Vi^K- + wf)\\i + -\\y^{wf)\\ 

A — ^ I 



Proof. Part (i) holds because Bi is Ni x Ni with rank(Sj) = rank(Aj) = Sj. Part (ii) follows from 
Theorem [T71 and Theorem [T21 For part (iii), notice that Ainax(-C'.(/P~ DJ) < trace(L'j'P~ L)f). Now 
using (|76p and Theorem 1181 gives the result. D 

Remark: For the ICD method, we require that rank(j4j) = Ni, because this ensures that Bi 
is a positive definite matrix. Notice that in this case, Theorem 1191 explains that all eigenvalues of 
A4i = V~ Bi are strictly greater than zero (i.e., Ni = Sj). 

Figure [T] shows the distribution of eigenvalues before and after application of the preconditioner. 
In the plot on the left the eigenvalues of both Bi and V~ Bi are shown, where Ci is 200 x 100 
(skyscraper-shaped) and D, is 10 x 100. In the plot on the right, the eigenvalues of both Bi 
and ■Pj" Bi are shown, where Ci is 450 x 500 (warehouse-shaped) and Di is 50 x 500 with p = 
10~^. In both cases the distribution of the eigenvalues is greatly improved after application of the 
preconditioner. The eigenvalues are clustered around one with few outliers. Further, in the right 
hand plot, notice that the largest eigenvalue is of the order ^ as expected from Theorem [T9l 

8 Numerical Experiments 

In this section we present small-scale preliminary numerical results to demonstrate the practical 
performance of Inexact Coordinate Descent applied to the problem described in Section [6Tl Specif- 
ically, we assume that the function F = f is quadratic, ^ = and the matrix A has block angular 
structure with n = 10 blocks. The vector x was constructed and b = Ax was computed and i.i.d 
noise (at a level of 1%) was added to the last i components of b. 

Each experiment (to be described shortly) was implemented in Matlab and was performed 
100 times, with the average result presented in the Tables [3] and HI 
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Number of eigenvalues 
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Distribution of eigenvalues 
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100 200 300 400 
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500 



Figure 1: Plots showing the distribution of eigenvalues before and after preconditioning. In the 
left plot the matrix Ci is 200 x 100 (skyscraper-shaped) and Di is 10 x 100, while in the right plot 
Ci is 450 X 500 (warehouse-shaped) and Di is 50 x 500. The distribution of eigenvalues improves 
greatly after preconditioning with clustering around 1. 



b\\y\\b\\l < tol where 



tol' is a user defined 



The stopping criterion was chosen to be \\Ax 
tolerance that was set to tol= 10~^. 

Each of the matrices involved is sparse with the density of Ci set to approximately 10~^ Mi Ni + 
Ni, and the density of the linking constraints Di set to approximately 0.5 iNi. 

The purpose of each experiment was to study the use of an iterative technique (with and without 
preconditioning) to determine the update used at each iteration of the block coordinate descent 
method. 

In the first experiment each of the blocks Ci is skyscraper shaped with Mi = 1250, Ni = 1000 
and Di has 1, 10 or 100 rows. The results are shown in the Table [3l 



Table 3: Experiment 1: Skyscraper shaped blocks. All times and iteration counts are average over 
100 runs. 





i 


Time(s) 


Its 


Inner its 


PCG 


1 


2.8 


854.4 


2.2 




10 


12.5 


2136.8 


4.8 




100 


49.0 


2756.5 


8.2 


CG 


1 


4.6 


857.0 


4.6 




10 


18.4 


2146.3 


6.2 




100 


65.9 


2769.4 


6.9 


Direct 


1 


15.0 


853.3 


— 


solve 


10 


81.0 


2134.2 


— 




100 


135.5 


2754.2 


— 



In the second experiment the blocks Ci are warehouse shaped with Ni = 1000 and Mi and i 
varying (while ensuring that Mi + i > Ni and Ai has full rank so that Bi is positive definite). For 
the preconditioner (J63|l . p = W~^. The results are shown in the Tabled! 
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Table 4: Experiment 2: Warehouse shaped blocks. Times and iteration counts are average over 
100 runs. 





I 


Time 


Its 


Inner its 


Ikll^/I|fc||^ 


PCG 


2 


2.31 


1000 


1.37 


0.18x10-4 




50 


5.62 


1000 


1.01 


0.18x10-4 




100 


39.78 


1000 


1.07 


0.10xlO"4 


CG 


2 


2.28 


1000 


1.01 


0.21x10-4 




50 


7.20 


1000 


1 


0.48x10-4 




100 


42.37 


1000 


1.01 


0.11x10-4 


Direct 


2 


6.20 


1000 


— 


0.07x10-*^ 


solve 


50 


6.22 


1000 


— 


0.20x10-6 




100 


27.21 


1000 


— 


0.23x10-6 



9 Conclusion 

In this work we introduce Inexact Coordinate Descent, which is a block coordinate descent method 
that uses an inexact update. Iteration complexity results are presented to show that the algorithm 
is guaranteed to converge with high probability when applied to a convex composite function ([1]). 
The theoretical results were complemented by practical considerations in the second half of this 
paper. Because an inexact update is allowed, iterative techniques can be used to determine the 
update to apply at each iteration and the advantages were highlighted by studying the quadratic 
case where block angular structure was present. The numerical results presented at the end of this 
work strongly support ICD. 
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